Credit Card Users Churn Prediction Project

The objective of this project is to find the best model in order to predict which customers have the potential of not using their credit card and leaving the services

Observations

The 'abc' value in Income column should be treated.

EDA

Observations

The above plots show that there are some outliers in the data which should be treated.

Observations

Most of the clients are still clients of the bank.

Observations

Most of the clients are Female.

Observations

Most of the clients have graduate degree.

Observations

Most of the clients are Married.

Observations

Most of the clients have less than $40K salary. As mentioned before 'abc' category should be processed to have clean data.

Observations

Most of the clients have Blue cards.

Observations

There is strong correlation between "Months_on_book" and "Customer_Age"

There is strong correlation between "Total_Trans_Amt" and "Total_Trans_Ct"

There is a correlation between "Avg_Utilization_Ratio" and "Total_Revolving_Bal"

There is also a negative correlation between "Total_Trans_Amt" and "Total_Relationship_Count".

There is also a negative correlation between "Total_Trans_Ct" and "Total_Relationship_Count".

There is also a correlation between "Total_Revolving_Bal" and "Avg_Utilization_Ratio".

There is also a negative correlation between "Credit_Limit" and "Avg_Utilization_Ratio".

There is also a negative correlation between "Avg_Open_To_Buy" and "Avg_Utilization_Ratio".

Observations

Female customers have more tendency to drop the credit card services

Observations

Customers with higher education have more tendency to drop the credit card services.

Observations

Among all customers it seems that married customers are more interested in credit card services of the company. but also 50% of those who droped the credit cards are among married clients.

Observation

Clients with lowest income seems that are not interested in our services.

Observations

Customers with higher level of card have more tendency to keep the credit card services.

Feature Engineering

First of all I would like to treat the 'abc' value in 'Income_Category' column. In order to do so, I will replace the 'abc' value to "Nan" and then I will treat them as Null values

1112 values have turned to Nan because of 'abc'

Observation

we can see that so many outliers are detected.

We will treat the null values after spliting the data in order to avoid data leakage

Data Preparation for Modeling

we would like to predict the clients that are interested in stopping the credit card utilization! So we are looking to maximize the Recall in order to minimize the false negatives! False Negative here means that the client would like to drop the credit card but it is predicted that the client would keep the credit card.

The best 3 models above are :

1- XGBoost

2- GBM

3- Adaboost

XGBoost Hypertuning with Randomized Search CV

it seems that we don't have the overfitting problem because we get a good recall score for both training and validation. but precision is not in the range that we are expected to get.

There is not much difference in the Recall for training and validation data, but precision is not good enough

GradientBoosting Hypertuning with Randomized Search CV

There is a decrease in Recall comparing to Xgboost model but the accuracy and Precision improved

Adaboost Hypertuning with Randomized Search CV

Adaboost hast the weekest performace among the 3 models but still better precision and accuracy compare to XGboost

Models Comparison on Training and Validation Data

Best result considering the Recall score is XGboost. we need more improvement still to get better results.

Oversampling train data using SMOTE

XGBoost tuned on Oversampled data

Compare to our previous model Recall improved on both training and validation set. but precision decreased.

Gradient Boosting tuned on Oversampled data

Compare to GBM_tuned model, this model is improved but from the point of Recall score still XGboost is better.

Adaboost tuned on Oversampled data

The result is comparable to GBM_tuned model but still not as good as XGboost.

Models Comparison on Training and Validation Data for oversampled data

XGboost performance is the best among the all models and The recall is slightly better with oversample data.

Undersampling train data using Random Undersampler

XGBoost tuned on Undersampled data

The precision is getting worst but we have a really good Recall score on training and validation data.

Gradient Boosting tuned on Undersampled data

This model is looking great compare to all other models. although the Recall score is a bit smaller than XGboost but it has a very good precision and accuracy too

Adaboost tuned on Undersampled data

This model also looks fine but GBM is stronger model than this one from the scoring point of view.

Models Comparison on Training and Validation Data for undersampled data

So the best model considering all the above methods is Gradient Boost tuned with hyperparameters with undersampled data.

XGboost has the best Recall but the precision and Accuracy is not good enough as precision and accuracy above 0.7 is requested.

Fitting model on Test data

we get almost the same result that we got from fitting the model to validation set.

Important Parameters

Conclusion

Considering the results from the EDA and Model buliding, I can suggest the following:

1- The bank should consider the clients with lower income and lower card level because it seems they don't enjoy much of what they get from the bank.

2- Most of the clients are Females and most of the clients who are not interested in keeping their credit cards are females which should be considered.

3- The credit limit of clients plays an important role and it has positive correlation in our model. The more credit limit, the less chance of leaving the credit card services from the clients.

4- The Marital Status of the clients is also have a great importance in having the chance of dropping the services. 50% of total clients who are leaving the services are married. The bank could offer and target these category more.